Content-Oriented Categorization of Document Images
نویسنده
چکیده
We have developed a technique that categorizes document images based on their content. Unlike conventional methods that use optical character recognition (OCR), we convert document images into word shape tokens, a shape-based representation of words. Because we have only to recognize simple graphical features from image, this process is much faster than OCR. Although the mapping between word shape tokens and words is one-to-many, they are a rich source of information for content characterization. Using a vector space classifier with a scanned document image database, we show that the word shape token-based approach is quite adequate for content-oriented categorization in terms of accuracy compared with conventional OCR-based approaches.
منابع مشابه
Text Categorization of Low Quality Images
Categorization of text images into content oriented classes would be a useful capability in a variety of document handling systems Many methods can be used to cat egorize texts once their words are known but OCR can garble a large proportion of words particularly when low quality images are used Despite this we show for one data set that fax quality images can be cat egorized with nearly the sa...
متن کاملText categorization using character shape codes
Text categorization in the form of topic identification is a capability of current interest. This paper is concerned with categorization of electronic document images. Previous work on the categorization of document images has relied on Optical Character Recognition (OCR) to provide the transformation between the image domain and a domain where pattern recognition techniques are more readily ap...
متن کاملAn Efficient Way of Communication without Posting Undesiered Messages in Online Social Network
To filter out violant, vulgar messages and unpleasant images on users own personal wall page by providing ability to control the messages and images. To evaluate a flexible system is called Filtered Wall, which has ability to filter undesired messages. With the use of Machine learning text categorization techniques to automatically assign each short text message based on its content. In order t...
متن کاملObject-Oriented Method for Automatic Extraction of Road from High Resolution Satellite Images
As the information carried in a high spatial resolution image is not represented by single pixels but by meaningful image objects, which include the association of multiple pixels and their mutual relations, the object based method has become one of the most commonly used strategies for the processing of high resolution imagery. This processing comprises two fundamental and critical steps towar...
متن کاملApply Uncertainty in Document-Oriented Database (MongoDB) Using F-XML
As moving to big data world where data is increasing in unstructured way with high velocity, there is a need of data-store to store this bundle amount of data. Traditionally, relational databases are used which are now not compatible to handle this large amount of data, so it is needed to move on to non-relational data-stores. In the current study, we have proposed an extension of the Mongo...
متن کامل